For our experiments we use the corpora provided in the SRLonly track of the shared task. Our MLN is tested on the following languages: Catalan and Spanish 
\citep{catalan-and-spanish-data} , Chinese \citep{chinese-data}, Czech 
\citep{czech-data},\footnote{For training we use only sentences shorter than 40 
words in this corpus.} English \citep{english-data}, German 
\citep{german-data}, Japanese \citep{japanese-data}. 

Table \ref{tbl:results} presents the F1-scores and training/test times for the development and
in-domain corpora. Clearly, our model does better for English. This is in part because the original model was developed for English. 

% SR: needless to say that if we don't argue why
% The model has the poorer performance for the German language.  
% Not typical errors, only for Czech
% In section \ref{sec:analysis} we review the typical errors of our system. 
To put these results into context: our SRL system is the third best in the \emph{SRLOnly} 
track of the Shared Task, and it is the sixth best on both \emph{Joint} 
and \emph{SRLOnly} tracks. For five of the languages the difference to the F1 scores 
of the best system is $3\%$.  However, for 
German it is $6.19\%$ and for Czech $10.76\%$.  One possible explanation for the poor performance on Czech data will be given below. Note that in comparison our system does slightly better in terms of precision than in terms of recall (we have the fifth best average precision and the eighth average 
recall).

% SR: Despite of what I said previously, I took these out because they don't make sense if we don't know that corpus sizes (and at least for Chinese we don't do here). So we just conform to the organizer request.
%We found that it took more time to train on the Chinese corpus. We are not sure of the cause of this timing. In the case of testing, our 
%slowest set was Czech which corresponds to the longest testing set. 

\begin{table}

\begin{center}
\small
\begin{tabular}{|l|c|c|c|c|}\hline
Language           & Devel        & Test       & Train & Test  \\ 
                   &          &         & time  & time \\\hline
%role score a
Average         & $77.25\%$    & $77.46\%$  & 11h 29m & 23m   \\\hline
Catalan         & $78.10\%$    & $78.00\%$  & 6h 11m  & 14m   \\ %0.740
%53060/1862=28.5, 390302 13200 29.5683, 37508,
Chinese         & $77.97\%$    & $77.73\%$  & 36h 30m & 34m   \\ %0.700
%73153/2556=28.6, 609060 22277 27.3403, 102832,
Czech           & $75.98\%$    & $75.75\%$  & 14h 21m & 1h 7m \\ %0.733
%70348/4213=16.7, 652544 38727 16.8498, 421281,
English         & $82.28\%$    & $83.34\%$  & 12h 26m & 16m   \\ %0.770
%57676/2399=24.0, 958167 39279 24.3939, 192766,
German          & $72.05\%$    & $73.52\%$  & 2h 28m  & 7m    \\ %0.712
%31622/2000=15.8, 648677 36020 18.0088, 18661,
Japanese        & $76.34\%$    & $76.00\%$  & 2h 17m  & 4m    \\ %0.609
%13615/500=27.2, 112555 4393 25.6214, 26605,
Spanish         & $78.03\%$    & $77.91\%$  & 6h 9m   & 16m   \\ %0.744
%50630/1725=29.4, 427442 14329 29.8306, 44371,
\hline
\end{tabular}
\caption{F-scores for in-domain in corpora for each language.}
\label{tbl:results}
\normalsize
\end{center}
\end{table}

Table \ref{tbl:outresults} presents the F1 scores of our system for the out of 
domain test corpora. We observe a similar tendency: our system is the sixth best 
for both \emph{Joint} and \emph{SRLOnly} tracks. We also observe similar 
large differences between our scores and the best scores for German and Czech 
(i.e., $>7.5\%$), while for English the difference is relatively small (i.e., $<3\%$). 

\begin{table}
\begin{center}
\small
\begin{tabular}{|l|l|l|l|}\hline
Language        & Czech & English & German\\\hline\hline
F-score           & $77.34\%$ & $71.86\%$ & $62.37\%$  \\
\hline
\end{tabular}
\caption{F-scores for out-domain in corpora for each language.}
\label{tbl:outresults}
\normalsize
\end{center}
\end{table}

Finally, we evaluated the effect of the \emph{argument siblings} set of formulae 
introduced for the Japanese MLN. Without this set the F-score is $69.52\%$ 
for the Japanese test set. Hence \emph{argument siblings} formulae  improve performance by more than $6\%$.

%\begin{table}
%\begin{center}
%\small
%\begin{tabular}{|l|l|l|}\hline
%Language           & Test      \\\hline\hline
%Japanese           & $69.52$     \\ %0.609
%\hline
%\end{tabular}
%\caption{F-scores for the Japanese test set without \emph{argument siblings} set 
%of formulae.}
%\label{tbl:japanese}
%\normalsize
%\end{center}
%\end{table}


%% Note german high exact match, most of the examples in the development set
%% do not have a predicate to label.

%\begin{table}
%\begin{center}
%\small
%\begin{tabular}{|l|c|c|}\hline
%Language        & Proposition         & Exact match \\
                %&              &            \\\hline\hline
%Catalan         & $46.69\%$ & $22.02\%$  \\
%Chinese         & $44.00\%$ & $20.70\%$  \\
%Czech           & $68.01\%$ & $8.64\%$  \\
%English         & $50.24\%$ & $19.55\%$  \\
%German          & $44.00\%$ & $85.20\%$  \\
%Japanese        & $35.36\%$ & $5.80\%$  \\
%Spanish         & $47.27\%$ & $20.75\%$  \\
%\hline
%\end{tabular}
%\caption{Proposition and exact semantic match scores for out-of-domain corpora.}
%\label{tbl:results}
%\normalsize
%\end{center}
%\end{table}



